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A FINITE ELEMENT BASED P 3 M METHOD FOR TV-BODY 

PROBLEMS 

NATALIE N. BEAMS*, LUKE N. OLSONt, AND JONATHAN B. FREUNDt 


Abstract. We introduce a fast mesh-based method for computing TV-body interactions that 
is both scalable and accurate. The method is founded on a particle-particle-particle-mesh (P 3 M) 
approach, which decomposes a potential into rapidly decaying short-range interactions and smooth, 
mesh-resolvable long-range interactions. However, in contrast to the traditional approach of using 
Gaussian screen functions to accomplish this decomposition, our method employs specially designed 
polynomial bases to construct the screened potentials. Because of this form of the screen, the 
long-range component of the potential is then solved exactly with a finite element method, leading 
ultimately to a sparse matrix problem that is solved efficiently with standard multigrid methods. 
Moreover, since this system represents an exact discretization, the optimal resolution properties of 
the FFT are unnecessary, though the short-range calculation is now more involved than P 3 M/PME 
methods. We introduce the method, analyze its key properties, and demonstrate the accuracy of the 
algorithm. 
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1. Introduction. TV-body interactions arise in a range of applications, includ¬ 
ing molecular dynamics, plasma dynamics, vortex methods, and viscous flow: sys¬ 
tems that are described by a Green’s function solution to the Poisson equation or its 
derivatives. We focus on three-dimensional electrostatic-like 1/R interactions, where 
R is the distance to a particle; this is the simplest kernel in three dimensions and 
well-known for this class of problems. However, the resulting algorithm we describe 
extends to other systems. We consider a periodic domain, which is commonly used to 
model extensive systems, and discuss a straightforward extension to other boundary 
conditions in Section 12.61 Without loss of generality we consider a L 3 cubic unit cell 
containing TV point charges, which has the total electrostatic potential energy 
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where is the electrostatic potential at location x, : of particle i with charge Qi. The 
central challenge in (0) is the computation of the potential. 
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because of the fairly slow 1/R decay rate of the interactions at large distances. 

There are a number of approaches for efficiently evaluating m- The most 
widely used methods are generally classified as either tree-based, such as the fast 
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multipole method (FMM) TT] . or mesh-based (sometimes called “particle-in-cell”), 
such as the particle-particle-particle-mesli (P 3 M) method [TB] and its popular variant, 
the particle-mesh-Ewald (PME) method 0 7]. In the FMM, particles are grouped 
within multipole expansions to provide an accurate representation of their combined 
influence at a distance, thus limiting the number of terms needed to explicitly compute 
the interactions. The resulting algorithm scales with O(N) complexity, although the 
coefficient in this scaling can be large, especially if a high-order multipole expansion 
is required for the desired accuracy m- Efficient implementations are intricate— 
especially in parallel—but demonstrated, and the FMM has been shown to be effective 
as an adaptive three-dimensional algorithm [4|. The method also extends to systems 
with more complicated kernels, such as Stokes flow .28, 20, 25] . 

In comparison, meslr-based methods also reduce the number of explicit calcula¬ 
tions but achieve this by splitting the potential into a rapidly decaying component 
<3> sr , which is accurately calculated with inclusion of only a few short-range interac¬ 
tions, and a smooth part $ sm , which is solved on a mesh covering the domain [IB] . 
It is instructive to view this splitting as the addition and subtraction of strategically 
selected “screening” functions, so that the potential in (11.21) decomposes as 

(i.3) +$r- 
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The particle-mesh-Ewald (PME) method [5] bases this decomposition directly 
on the Ewald summation [8] for m and uses Lagrangian interpolation to move 
between particle locations and the mesh, while the smooth PME (SPME) uses 13- 
spline interpolants, similar to those proposed in the P 3 M method [7]. PME-based 
algorithms use Gaussian screening functions, as illustrated in Figure 11.11 Here, the 
screen is designed to yield a <f> sr that is straightforward to calculate within a prescribed 
cutoff at radius R Cl while the long-range portion of the potential remains smooth. In 
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Fig. 1.1: Introduction of a Gaussian screen to define 11.3b and resulting decomposition of the 
potential of the singular charge. 


PME-based methods, the Gaussian screen yields a <h sm that is accurately solved by 
fast Fourier transforms (FFTs). To do this, the screen is interpolated to a regular 
mesh and the Poisson (or similar) operator is inverted. For computational efficiency 
it is desirable that these screens be as compact as possible and barely resolved on 
the mesh, since this maximizes the decay of the screened potential <f> sr . Fast decay 
allows for a small point-to-point interaction cutoff distance R c , which reduces the 
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number of interactions that need to be explicitly computed for the targeted accuracy. 
The ideal wavenumber resolution of the FFT provides accurate representation of the 
most compact screens possible. The FFT also makes these methods most natural for 
periodic domains, but they can be extended to free space m mm- 

Here we propose a fundamentally new decomposition that is constructed within 
a P 3 M-type framework. The method incorporates screen potentials that are selected 
to yield exact mesh potentials, which has many potential benefits. The screens are 
designed for a mesh and thus have no explicit dependence on problem geometry; this 
suggests complex geometries as well as more general boundary conditions fit naturally 
within this method. In addition, the exact mesh potential recast the problem as sparse 
matrix problem where multigrid methods are known (and shown in Section 4) to be 
effective and scale to high core counts |T[. As a result, since the method does not rely 
on the Fourier resolution for an accurate mesh solution, a global FFT can be avoided, 
which may be beneficial at extreme scales. Indeed, while multigrid methods are 
ultimately latency bound, they do not exhibit the strong dependence on a machine’s 
half-bandwidth, which is a limited factor of using multidimensional FFTs a large core 
counts flOj . 

The new decomposition we propose comes with the cost of representing more 
intricate short-range interactions. The calculations are more involved than the simple 
isotropic point-to-point interactions of PME, but are tractable and more importantly 
local, which contributes to scalability. As we highlight in the following sections, the 
short-range potential also has fast but algebraic decay (up to 1/R 6 in our examples), 
which is less attractive than the exponential decay seen in PME, thus possibly leading 
to more local interactions. 

Improvements to Ewald-type schemes range from coarsening strategies to reduce 
the number of grid points by using a staggered mesh [3] to multilevel approaches [2] 
that yield increased locality in the FFT calculations while resulting in only a small 
increase in total work. Moreover, other methods such as the Multilevel Summation 
Method (MSM) |25l [26lE3H3 take different approach to operator splitting altogether. 

In summary, the goal of this paper is to detail a method the incorporates mesh- 
based screens and to investigate the accuracy of such an approach. In Section [2j we 
develop the mathematical construction of each component. In particular, we detail 
the screen functions that lead to the exact sparse linear system for 4> sm and the local 
evaluation of 4 ,sr . In Section [3] we develop a performance model for the method 
and discuss its implications in a parallel setting. A numerical experiment is shown 
in Section [f] to confirm the accuracy of our method. Additional considerations and 
possible extensions are discussed in Section [5] 

2. Description of method. The Ewald decomposition is often viewed through 
the construction of a screen potential to define the corresponding short-range and 
long-range potentials. The usual PME formulation is consistent with the original 
Ewald decomposition in that it uses a Gaussian screen function 
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This screen, as depicted in Figure [HU yields a short-range potential so that <h sr oc 
erfc(a|x — Xi|)/|x — Xj|, which is straightforward to compute. The resulting mesh 
potential satisfies the Poisson problem, 
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which is then optimally solved using FFTs on a mesh. 

We instead propose screening functions Pi(x ) that are piecewise polynomials of 
order q , as shown in Figure 12.11 The corresponding potential is then solved exactly 
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Fig. 2.1 : Introduction of a polynomial screen and resulting decomposition of the potential of the 
singular charge. 


with (12.21) using a finite element method with basis functions of order p = q + 2. That 
is, the potential is represented exactly in the finite element space, making the optimal 
resolution provided by an FFT-based solve unnecessary. 

Next, we describe the details of the method, following the four basic steps of 
P 3 M methods: assignment of charges to the mesh, solving for the smooth potential 
on the mesh, transferring the potential back to the charge locations, and calculating 
the point-to-point (short-range) interactions. A high-level synopsis of the algorithm 
is described in Algorithm [Q to illustrate the structural pieces of our approach. 


Algorithm 1: Polynomial Screen Method for Calculating Potential 
Input: A mesh of elements e ; and a group of point charges Qi 
Return: Potential at locations of charges 

for each charge Qi {charge assignment, Section \__ __[ 

place Qi in element 

solve for screen t nrm or reran 

for each element e 7 

if element e 7 G surface 

I adjust boundary conditions as necessary {see reran 

apply charge assignment operator to form p m {see rem i 

perform multigrid solve of —V 2, F sm = p m {see rem 

for each charge Qi {evaluations, Sections |2.3| and \2.A\ 

| ^ “l - {mesh-to-charge assignment} 


We assume a collection of N charges Q = {Qi}^ =1 located at x 7 in a cube f2 = 
[0,L] 3 (see Figure lT2l) . A mesh with N e i = nj x n v el x nj, elements is constructed to 
conform to the domain, and a uniform mesh is assumed in each direction for simplicity 
of presentation — i.e., n e i = n^ = n v el = n^. Finally, the collocation points for g-order 
basis functions on the mesh are denoted x™, with j = 1,..., M = ( qn e \ + l) 3 . 

2.1. Charge assignment. A central component of particle-mesh methods is 
the assignment of singular charges to the mesh, yielding a mesh-based charge density 
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Fig. 2.2: Schematic configuration showing N charges of strength Qi at locations x? distributed in 
the cubic domain £1 of size hn e \ x hn e \ x hn e \, where h is the size of the cubic finite elements. The 
finite element centers are x and the collocation points are x™. 


function, p m (x m ). In particular, we seek an assignment function W(x) that reflects 
our specially selected screen functions and provides a weighting that distributes a 
charge Qi at x£ to each collocation point x™ of the basis functions: 

N 

(2.3) p m (x“) = Qi^W(x“;x^). 

i= 1 

Existing methods use Lagrange polynomials (PME [5]) or B-splines (P 3 M ;TB] and 
SPME 0) for this weighting, the latter of which work particularly well with FFTs. 
The charge assignment function impacts both accuracy and efficiency of the method. 
In our approach we design an assignment operator based directly on polynomial basis 
functions for compatibility with a finite-element-based Poisson solver. 

2.1.1. Defining the polynomial screens. We define our screen density func¬ 
tion for a single charge Qi £ Q as 

(2.4) pi(x) = yTjV’j(x), 

j 

with linear superposition providing the extension to multiple charges. Here iftj (x) are a 
collection of g-order Lagrange basis functions over an index set determined as follows. 
If charge Qi is located within element t 3 of the mesh, we choose V* = U r nr 3 - 5 £o' r 
to be the interpolation support of the charge assignment operator. That is, the 
support is the union of the element of the mesh that includes the charge along with 
all neighboring elements, leading to a support of 27 elements in three dimensions. 
Generalization to other choices for this support are briefly discussed in Section [5] To 
construct the polynomial screen, we consider the degrees of freedom which are interior 
to or on the faces of the element containing the charge. For g-order interpolating 
polynomials, this leads to dim(Vf J ) = (q + l) 3 degrees of freedom. These degrees of 
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freedom are determined so that the charge-screen combination has a potential that 
decays rapidly in space by considering the multipole expansion of the screen for a 
point well outside the screen, given by 
(2.5) 

$r(x) = i/ Pi(£)d£-i-/ MO «•*) d*+-l- f M0[3(^) 2 -I€| 2 ]d£+--- 

K Jv l Jv i Zil ' 5 Jv i 

p p p 

where R = (x — it l p )/h, with x representing the observation point, x^ is the center of 
the screen volume, and h is the mesh size. The quantity r is the unit direction vector 

R/|R|. 

For a charge Qi located at x,; = ( Xi,yi,Zi ) in element r, (see Figure liHil) . we denote 
the offset <5, = (<5f, Sf, 5f) = x, — xy with respect to the center of the element xy, and 
define the (l,m,n)- moment and centered (l,m,n)- moment of the screen function as 


(2.6) 

( l,m,n ) _ 1 

ri 1 

(x - 5?)\y - &i) m {z - 6~) n pi(x.) dx, 


Jv i 

(2.7) 

-( l.m.n ) / 

pi = / 

x l y m z n Pi (x) dx. 


where the origin is taken to be x®, the center of the screen volume V.]. Dividing 
p \ 0 ’°,°) = f vi pi (x) dx by R = |x,; — x |/h gives the first term of the screen’s multipole 

expansion from (12.511 . Thus, requiring p^ 0,0,0 ^ = 1 guarantees that the combined point- 
charge and screen have a potential that decays at least as fast as 1 / R 2 with distance 
from the point charge. Likewise, zeroing higher moments of the screen enforces the 
cancellation of dipole and higher-order terms and further accelerates the long-range 
decay rate, thereby reducing the number of interactions that must be explicitly rep¬ 
resented by point-to-point computations. With the available degrees of freedom, a 
screen of order q cancels all terms up to leaving $ sr = 1/R — $ sc ~ R~^ q+2 \ 

This is summarized in Table 12.11 which shows the moments that result from perform¬ 
ing the vector operations in the integrands of (12.51) . 


Power of R 

Single terms 

Mixed terms 

R- 1 

1 

— 

R~ 2 

x,y,z 

— 

R~ 3 

x 2 ,y 2 ,z 2 

xy, xz, yz 

R ~ 4 

x 3 ,y 3 ,z 3 

x 2 y, xy 2 , x 2 z, xz 2 , y 2 z, yz 2 , xyz 

R~ n 

x N ^ 1 ,y N ~ 1 1 z N ~ 1 

X l y m z n , with 

1 < l, to, n < N — 2, 
and l + m+ n = N — 1 


Table 2.1: Polynomial terms in multipole expansion. 


Constructing the screen. The goal is to perform multipole cancellations with 
screens that are also compatible with the basis functions of our finite element dis¬ 
cretization. With this description, each screen is composed of N sc = n^ c = (q + l) 3 
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nodal screen basis functions, ip: 

N S c — 1 

(2.8) Pi(x) = ^2 c 3 %( x )- 

3 =0 

Thus, revisiting (12.31) . the assignment operator W is 


(2.9) 


Mx”;x‘) = 


^ ^sc-1 

E c k (6i)ip k (xf) 

fc=0 

0 


x“ g y; 


otherwise. 


Restricting the first 1V SC moments leads to a iV sc x jV sc linear system for the coefficients 
c in (12.81) : 




^(0,0,0) . 

Vn sc -1 


Co 


1 

(2.10) 


v4 1,0,0) • 

.. W,(b0,0) 

XJV SC -1 


Cl 

= 

0 



lp[ q ’ g ’ q) • 

^IVsc-lJ 


_Cjv sc -1_ 


0 


where ipj d> m > n ) is the (l, m, n)-moment, as defined in (12.61) for p , of the j-th screen 
basis function ipj{x.). 

Figure 12)31 shows cross-sections of example screens constructed using q = 1,..., 4. 
Each screen’s peak is attained near the marked charge, and the screens are constructed 
to decay to zero at the edge of V p . The screens have support in the active screen region 
Vp and for q > 1, the screens are in general non-monotone. 

We confirm in Figure [2)4] that screens constructed in this fashion yield potentials 
with the expected behavior: in all cases, the far-field behavior approximates 1 /R. In 
Figure [231 we estimate mean and peak errors incurred for point-to-point interaction 
truncation at a distance R c = R c /h. To do this, the short-range potential is con¬ 
structed for N = 84 charge locations and sampled in 42 directions; the behavior of 
|$ sr | is shown in the figure as the weighted average |$ sr | avg and the maximum from all 
samples |$ sr | max . In addition, the radius from a charge location is also normalized as 
R = R/h, and is denoted R (see Figure l242l) . We see that twice the screen-size scale 
R » 3 corresponds to the expected start of the asymptotic decay behavior. This is 
the distance at which a multipole expansion is generally considered “well-separated” 
and expected to show convergence with R. Both the mean and peak errors show the 
expected behavior for increasing q beyond this distance. 

A fast algorithm for screen construction. Solving (12.101) directly requires 
O(N^) operations for each screen, which is feasible, but is not necessary in general. 
In the following, we design a fast algorithm for computing screens, which follows from 
a generalization of the Parallel Axis Theorem applied to moments used in the system. 
First, we note that the moments are additive. For example, the first moment in 
variable x of basis function ipj satisfies 


ip. 
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/ zV’(x) 
W P 

r(l,0,0) 


dx - 6? 


ip(x) dx 


= W- ’ u ’ u; - 
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(b) q = 2 
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Fig. 2.3: Example linear through quartic screens for a single marked charge. 




( a ) 


(b) 


Fig. 2.4: Screen potential in two directions: (a) x — Xj oc (1, 0,0) and (b) x — Xj oc (1,1,1). 


where tpj is the centered moment of the j-th basis function as in m ■ Therefore, 
second row of (12.1011 is equivalent to 


IVsc-l 

3 = 0 


JV.o-1 

= i; £ <*C” = 


3=0 


( 2 . 12 ) 
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Fig. 2.5: Maximum (solid) and average (dash-dot-dot) short-range potentials for screens with q = 1 
to 4. The straight dash-dot lines denote the expected slopes. 


Similarly, the second moment of xjjj satisfies 

^(2,0,0) _ f a; 2 ^( x ) dx _ 25 * f dx + (Sf ) 2 f ^(x) 

•U4 Jv„ Jvo 


dx 


(2.13) 


= - 2 5 x ^f m + 


so the third row of (12.101) becomes 

ATsc-l N ac -1 

r(2,o,o) — ST ..^(b 0 . 0 ) 


N S c~ 1 


14 , E E ^r o) - (*f) 2 E ^ 

1=0 1=0 i=o 

= 2(5f) 2 - (<5?) 2 = (Sff. 

Continuing this procedure for other moments in (12.101) yields: 


( 0 , 0 , 0 ) 


(2.15) 


4°-0-0) <’°’ 0) 
^( 1 , 0 , 0 ) ^( 1 , 0 , 0 ) 


4 q ’ m 

4°’ m 


A °’°’°) 

rN ac -l 


°-°) 

^iVsc-1 




V/V 3C -1 


Co 

Cl 


CiV BC -l 


(<5?) 9 

(*fWW) a 


which we write compactly as Cc = f. An advantage of this form is that for a uniform 
mesh, the matrix C is the same for each screen, since the moments reference the center 
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of the element. As a result, the matrix is pre-factorized leading to a complexity of 
only 0(Ng C ) to solve for each screen. 

For small q this yields a small operation count, yet the computation is further 
reduced if ^(x) is separable, as is the case for the regular cubic mesh shown in Fig¬ 
ure [2T2] In this case, 

(2.16) «/> K (x) = Ui(x)ujj{y)u} k {z), 

where uj, are the one-dimensional nodal basis functions for a mesh size h and k = 
i + (q + 1 )j + (q + 1 ) 2 fc with i,j, k £ [0, q\. With i[j separable, the moment integrals 
are also separable: 

# m ’ n) =/ x ; t/ ro AUx)dx 
Jv; 

( r3h/2 \ / f3h/2 \ / p3h/2 \ 

/ x l Ui{x)dx\ / y m u}j(y)dy) / z n uj k {z)dz . 
J~3h/2 J \J~3h/2 J \J-3h/2 J 

Following the notation of (12.711 . we define 

/ 3h/2 

x l uii(x) dx 

-3/1/2 

and likewise for the one-dimensional y and z centered moments. We take c K = 
wfWjiu^ and recognize that the right-hand side of (12.151) is also separable as = 
(S x ) 1 (S y ) m (5 z ) n with p = l + (q+l)m+(q + 1 ) 2 n. This yields three equivalent (q + l) 2 
systems of the form 

(2.19) J24\x)w* = (6 X ) 1 , l = 0,...,q 

2=0 

which are solved independently. The inverse of the matrix is computed once and 
applied for all right hand sides, resulting in only 0(n^ c ) operations per screen. This 
method also extends to regular rectangular meshes, where h. is not necessarily equal 
in each direction. 

2.2. Solution of the mesh potential. Given our construction of the screen p 
using the finite element basis functions via (HD, the solution of d* 8111 is straightfor¬ 
ward. We simply use a finite element solver with basis functions zt m of order p = q + 2, 
which results in a symmetric, positive definite sparse linear system (under realistic as¬ 
sumptions regarding the boundary conditions) that does not introduce any numerical 
approximations. 

Multigrid preconditioners are effective for this problem, even for high-order bases, 
and allow the sparse matrix problem to be solved to any level of accuracy. As an ex¬ 
ample, consider the case of a high-order finite element discretization of the Poisson 
problem with Dirichlet boundary conditions. Figure 12.61 shows the convergence his¬ 
tory of a multigrid preconditioned conjugate gradient method for basis functions of 
order p = 1 through 6. An algebraic multigrid preconditioner, based on smoothed ag¬ 
gregation using a more general strength measure |22| and optimal interpolation opera¬ 
tor [21], is used. We observe only a weak dependence on p. Moreover, more advanced 
multigrid techniques have shown still better scalings for both Poisson and other ellip¬ 
tic problems such as for Stokes flow HSlEQj. Importantly for our principal objective, 
multigrid preconditioners are well-known to exhibit high parallel efficiency |10l TJ. In 
the following tests, we use AMG through the BoomerAMG package m 
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Fig. 2.6: Convergence history for the Poisson problem using basis elements of order p = 1, ..., 6 for 
both the conjugate gradient method (dot-dashed) and multigrid preconditioning (solid). 


2.3. Evaluation of the smooth potential at charge locations. The next 
step is to evaluate <f> sm at the charge locations. For standard P 3 M and PME imple¬ 
mentations, this involves interpolation with the Lagrangian or B-spline basis functions 
from the charge assignment. In contrast, our method requires no interpolation, though 
interpolation can be used to speed up calculations, if desired. Since the smooth po¬ 
tential exists in each element as a linear combination of coefficients — i.e., the values 
of 4> sm at x m for all N m points in an element — and the basis functions u m , the 
smooth potential is expressed exactly at any point as 



( 2 . 20 ) 


where N m is the number of collocation points in an element. Direct evaluation at the 
charge locations x = is straightforward. 

2.4. Short-range potential. Our formulation for the exact mesh solution yields 
a more complex short-range interaction than PME. In addition to R , the short-range 
interaction now also depends on the position of the charge relative to the under¬ 
lying mesh. Consequently, additional effort is required to evaluate the short-range 
interaction. However, the calculation is local , so it does not inhibit parallel efficiency. 

The short-range potential at point x due to a charge Qi located at x, is 


( 2 . 21 ) 



where 


( 2 . 22 ) 
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Though feasible, performing accurate quadrature for each screen individually is com¬ 
putationally expensive. We therefore shift a significant portion of this computational 
effort to a pre-processing step, for which there are multiple options. 

One approach is to consider a look-up table of pre-computed values for the screen 
potential evaluated at x 3 due to a charge at xj). These values are represented in 
a six-dimensional look-up table as $ sc (x :; — x“ ; d, ), since they are a function of the 
difference between the evaluation point and the charge location, and also the offset of 
the charge within its element (which determines the screen). 

With some additional computation, but still without resorting to direct evaluation 
of (12.221) . it is possible to remove the charge offset interpolation to reduce errors. We 
accomplish this by recognizing the screen’s formulation as a linear combination of 
basis functions, 


(2.23) 

(2.24) 


Nsc-l 


*r( X ) = Qi E c *(*) 


'vi 


^•(0 

|x-*l 


d£ 


j=o 

Afso-l 

= Qi E Cj(5i)$E( X_X i)' 

j =0 


This approach yields N sc look-up tables for basis-function potential values $ b,sc (x :) — 
x£). However, for q > 2 the polynomial nature of the screen leads to non-monotonic 
decay for some directions within the region where the screen is active, as shown 
in Figure 12.41 Consequently, a direct implementation of a look-up table for such 
functions requires sufficient resolution, which is harder to achieve for larger q. For 
good performance, knowledge of the underlying structure of the screen potentials 
should be used to inform both the storage locations for the look-up table values and 
the interpolation method. 

2.5. A note about the self term. If the point x is the location of a charge, we 
do not wish to include the potential due to this charge in our calculation. However, 
we do still need to subtract the screen potential from the charge’s own screen, which is 
sometimes called the “self” term. We can allow for this by amending our short-range 
potential expression to include both cases: 


(2.25) 


$f(x) 


-*r( X ) 


|x - Xj| > 0 
otherwise. 


2.6. Alternate boundary conditions. The formulation above is presented 
under the assumption of periodic boundary conditions, which is the simplest case and 
important for a range of applications. It is straightforward to generalize boundary 
conditions via the mesh potential $ sm . This is accomplished by adjusting for short- 
range effects present at the boundary and then proceeding in the usual manner for a 
finite element problem with the given type of boundary conditions. For example, for 
a Dirichlet boundary condition of $ = g on dV, the condition for our mesh problem 
becomes 


= 3 - ^“lav, 


(2.26) 

which leads to 
(2.27) 


<b|oF = $ sm |av + <I> sr |av = g. 
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A similar approach is used in 53 . This also extends to the case of Neumann or mixed- 
type boundary conditions, with the usual constraint to address the non-uniqueness of 
the fully Neumann problem. Free-space conditions impose the usual challenges but 
are no more difficult for the proposed scheme than for any mesh-based Poisson solver. 

3. Performance model. The computational cost of the method for N charges 
is formulated as O(N) + O(M), where M = ( pn e \ + l) 3 is the total number of degrees 
of freedom in the mesh. Example CPU time scalings for the major TV-related com¬ 
ponents is illustrated in Figure 13. lal with the M -dependent mesh solve times shown 
in Figure 13. lbl Given N and M and assuming on average neighboring elements 
in the short-range interaction list for each charge, then it is possible to express the 
coefficients in the linear O(N) + 0{M) operation count in terms of p. Such a formu¬ 
lation provides a more detailed description of the actual costs of each component of 
the method and their relationships to the order of the screens. 



(a) Total CPU time vs. N for the main N- 
dependent components of the algorithm, in¬ 
cluding screen creation, short-range calcula¬ 
tion, basis function evaluation, and combi¬ 
nation of short-range and mesh potentials at 
charge locations. 



(b) CPU time to solve for mesh potential 
versus number of mesh points M. The num¬ 
ber of elements per coordinate direction is 
varied from 7 to 19. 


Fig. 3.1 : Total CPU time for (a) TV-related components, and (b) mesh solve (M-related). 


3.1. Breakdown of costs. Screens p^(x) of order q are built out of (q + l) 3 = 
N sc basis functions — recall that p = q + 2. The corresponding finite element solve 
associated with these screens involves (p + l) 3 degrees of freedom per element and 
a total of M degrees of freedom. We also define the average number of charges per 
clement as N = N/N e \. 

3.1.1. Screen construction. For each evaluation, the element containing each 
charge is identified, and the offsets from the center of these elements determined. This 
incurs a small 0(N ) cost, which we designate C\N. The screen coefficients are then 
calculated. As shown in (12.191) . assuming pre-computed inverses, this amounts to three 
matrix-vector multiplications of size q + 1 = p — 1, for a cost of 6(p — 1) — 3(p — 1). 
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We then multiply the one-dimensional weights, resulting in two additional floating 
point multiplications. The total cost for determining the screen coefficients is thus 

(screen construction) ~ [ 6(75 — l ) 2 — 3 (p — 1) + 2(p — 1) 3 ]7V. 

3.1.2. Short-range potential. The cost of evaluating the short-range poten¬ 
tial depends on the method chosen for calculating 4> sc , as discussed in Section 12.41 
In addition, there is a cost of O(N) due to the singular part of the short-range cal¬ 
culations, which we denote SN. For a general six-dimensional look-up table, the 
cost of calculating 4> sc at a point due to all charges in the short-range interaction 
volume is CqNN^, where 62 depends on the order of interpolation used. If N sc 
three-dimensional look-up tables are used, as we have done in the example calcula¬ 
tions of Section[4j then the interpolation is repeated for (q + l ) 3 tables and combined 
by an inner product with the screen coefficients and a multiplication by Qi for a total 
of [62 {jp — l ) 3 + 2 (p — 1 ) 3 ]NN*f. The cost for the short-range calculation is then 

(point-to-point evaluation) ~ SN + [ 62(75 — l ) 3 + 2 (p — 1) 3 ]NN^N. 

Since N = N/N e 1 , this expression is also written in terms of N 2 . However, we 
assume that in practice, iViVb r is chosen to be small enough to render this effectively 
as O(N). Furthermore, if TV is > 1, this cost is reduced further by calculating the 
effects of all charges in an element at once in an “element-to-point” operation. To do 
this, we compile a combined list of QiCi for all the charges in any given element, so 
that the screen potential for this sum at a point as calculated by (12.241) is the same 
as if the charges were handled individually. The cost then is then reduced by a factor 
of N yielding 

(element-to-point evaluation) ~ SN + [ 62(75 — l ) 3 + 2{p — l ) 3 — 1 ]N*fN. 

3.1.3. Mesh solve. The “transfer” of the order-g screens to a representation in 
order p = q+2 basis functions by (12.91) to construct the source p m in the right-hand side 
of the finite element solve (El requires an inner product between a vector containing 
the screen coefficients c(<5) with the evaluation of the order-g basis functions at the 
collocation points, followed by a multiplication by Qi. This is done at each degree of 
freedom within an active screen area, for a total of (3p + l ) 3 x [ 2 ( 75 — l)]iV operations. 
The multigrid solve for the finite element problem is 0(M ), with a coefficient C that 
depends on the convergence of the iterations, but is considered low in practice. Overall 
the mesh solve thus has complexity 

(mesh solve) ~ {(375 + l) 3 x [ 2(75 — l) 3 ]}iV + CM. 

3.1.4. Evaluation. The smooth potential is written as a combination of basis 
functions at the location of each charge, as in (12.201) . Thus evaluation involves (75 + l ) 3 
basis functions at a cost of 2p operations for each function. However, empirically we 
find that this cost is minimal in terms of CPU time. 

3.2. Summary. The screen creation (mostly due to the “transfer” portion) and 
short-range interaction calculations are the most costly even for modest values of 
75 given the scaling shown above. The relative costs of these two portions of the 
algorithm depend on choices in short-range calculation method, mesh size, and q. For 
any given cutoff error, decreasing mesh spacing decreases the number of short-range 
interactions, but results in an increased number of collocation points M in the mesh 
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solve. Likewise, increasing q also decreases the number of short-range interactions, 
but at the price of the increased cost of constructing and manipulating screens for 
larger q. Calculating the short-range effects of each individual charge becomes more 
costly with increased q , though at a slower rate than the transfer. The scaling of these 
components is shown in Figure lT2l for cases of N = 10 2 to 10 6 randomly distributed 
particles in a triply-periodic box with 6 859 elements. It is noted that once N 1, 
the singular short-range calculation loses its linearity in N. However, the screen 
potential portion of the short-range calculation retains its linearity due utilization of 
the “element-to-point” evaluation method. 




( a ) ( b ) 


Fig. 3.2: Example CPU time vs. N for the two most costly local portions of the algorithm: (a) 
creation of the screens, (b) calculation of short-range interactions for = 7x7x7. The dot-dash 
lines show the time associated with calculating singular potentials, while solid lines show times cal¬ 
culating element-to-point screen potentials. At large N, there is an expected breakdown in linearity 
for the singular potential calculations. 


4. Example calculation. We consider cases with N ranging from 10 2 to 10 5 
unit charges placed in a triply-periodic unit cube of elements with h = 0.067. The 
exact positions are selected randomly, but distributed so that any given charge ex¬ 
periences both long-range interactions, on the scale of the overall periodic domain 
size, and short-range interactions of comparable magnitude. This is done to provide a 
balanced test of both the short-range and smooth portions of our decomposition. To 
achieve this, the charges are randomly distributed within two smaller cubes: [ 0 , 1 / 2] 3 
is biased toward positive charges, 55% to 45%, and [1/2,1] is biased equally strongly 
toward negative charges. This set-up is visualized in Figure l4~Th for N = 100. 

The potential is then calculated using a short-range interaction of 7 x 7 x 7 = 343 
elements (corresponding to a minimum possible value of 3 for the cutoff distance 
R c ) for linear through quartic screens. This short-range cutoff is chosen to ensure 
that the short-range potential of every charge near the cutoff exhibits asymptotic 
behavior. The short-range calculation uses the approach of (12.2411 . with N sc look-up 
tables. These experiments are tested using a dual, quad-core Intel Xeon E5506 CPU 
with 48 GB of main memory. 

Remark 1. In our current implementation, we use a variation (but equivalent 
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form) to this construction, in which the values stored in the tables are for “basis 
screens” instead of screen basis functions. These basis screens, p basls ; are the polyno¬ 
mial screens associated with each node in an order-q finite element. The values of each 
table are computed as a Dirichlet finite element solution for Poisson’s equation, with 
—V 2 $j 6asl,s = p\ asls . The computation is completed in a domain larger than the size 
that will be kept in the look-up table to minimize boundary effects. Because these basis 
screens follow our moment-canceling rules, they have long-range decay ~ R~~ , and 
the boundary conditions are accurately set by the first terms of the multipole expansion 
The finite element solver uses basis functions of order p, and the look-up tables 
are stored in terms of their order-p basis functions, allowing them to be evaluated and 
combined in the same way as $ sm for all charge locations. We note that because the 
number of tables and coefficients is unchanged, the computational complexity for the 
short-range calculation is not altered by this variation. 

Upon calculation the potential is compared, allowing for a constant which is 
included in a potential and in this case is equal to the average value of $ sm throughout 
the computational domain, with that of an Ewald summation <f> E with large enough 
resolution that we consider it the “exact” solution. This uses two periodic images in 
physical space with a 2 = 6.25 and four modes for each direction in the Fourier sum. As 
we see for a representative calculation in Figure [OJ the method has super-algebraic 
convergence with q. 
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Fig. 4.1: Configuration of demonstration calculation: (a) Location of o positive and □ negative 
charges for N = 100, (b) <f> at each charge location for the 100 charge case, arranged by the charge’s 
location in x\ the red and blue lines mark the average potential value at locations in the positively- 
biased group and negatively-biased group, respectively. 


In each of these tests, we use an algebraic multigrid preconditioned GMRES solver 
with single precision residual tolerance - i.e., le — 7. Boomeramg |13] is used and 
the resulting method yields 8 or fewer iterations in each of the tests reported above. 
The timing dependence on mesh size is reported in Figure l3.1bl where we see that the 
solver exhibits 0(M) scaling. 

4.1. Estimated memory requirements. The memory requirements of the 
method in our example calculations are classified as finite element matrices or particle- 
related arrays. As the number of elements increases, the finite element matrices 
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Fig. 4.2: Convergence of the max and RMS measure of relative error in $ versus q for a sample 
case with N = 10 000. 


comprise a majority of the total allocated memory, as demonstrated in the following 
for the case of N = 10 6 : 


Order of 

screens 

Particle 

arrays 

FE matrices 
(N ei = 9 x 9 x 9) 

FE matrices 
(TVei = 25 x 25 x 25) 

linear (q = 1): 
quartic (q = 4): 

184 MB 
1120 MB 

39 MB (ft 18%) 
1290 MB (ft 52%) 

844 MB (ft 81%) 

27 600 MB (ft 95%) 


As the polynomial order increases — e.g. q = 4, which corresponds with a 6th order 
basis for the finite element solve — this effect increases as expected. We note that 
this memory footprint is typical for high-order FEM, but more optimal methods do 
exist [IS] . 

5. Discussion. 

5.1. Comparison to PME. While the method proposed here incorporates sev¬ 
eral advantageous features of PME, there are several notable differences that may offer 
benefits in certain settings. Our method no longer relies on the FFT, which may be 
limiting a extreme scales (in comparison to other Poisson solvers) and forces an as¬ 
sumption of structure on the compute geometry. The key is the introduction of a 
mesh-based screen, which introduces additional complexities locally, but also allows 
for a more general decomposition of the problem. There are particle-mesh variants 
that use finite elements — e.g., some PIC methods Eina — but these have been pro¬ 
posed with a symmetric screen, which must be resolved on the mesh. We avoid this ap¬ 
proximation, but at the cost of more intricate screen functions, which are constructed 
with (and the resulting potentials evaluated by) using memory-local operations. This 
fundamental difference hampers direct cost comparison with PME/P 3 M methods, 
which perform well when global FFTs do not impose restrictions. Still, we make 
some general comparisons in the following. 

The locality of the new method comes at the cost of more intricate screens, 
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which incur an 0(p e ) cost when represented by p-order basis functions as discussed in 
Section [3] This is larger than the 0(p 3 ) cost of the p-order B-spline interpolations in 
PME. However, the polynomial order p in the present scheme and the B-spline order 
p in PME are only loosely related. The B-spline order affects the overall accuracy of 
the PME method since it affects the resolution of the mesh description of the smooth 
potential. The polynomial order p in the present method does not, since the mesh 
solve is exact for any p. Instead, p affects R c via the decay of the screened potential 
as shown in Figure 1^31 This is important, since for uniform charge density the cost of 
point-to-point interactions scales with volume ~ R 3 . An independent Ewald splitting 
parameter sets the corresponding truncation error at fixed cut-off radius for PME. 

Similarly, the mesh density has different implications in the two methods. As 
with the B-spline order, the mesh density in PME affects the accuracy by providing 
more resolution for the potential. A denser mesh does not affect the short-range 
calculation, but requires more global communication for the FFT. In contrast, the 
mesh density in the present scheme decreases the communication burden for the short- 
range component of the calculation by reducing the number of interactions included 
for a given R c , since the cut-off radius is scaled by the mesh size, unlike in PME. The 
communication required of the mesh solver is that of multigrid. 

5.2. Comparison to FMM. The method presented in this paper shares sev¬ 
eral attractive features of the fast multipole method, most notably the linear scaling. 
The relative merits in comparison to FMM are likely application dependent, and the 
preferred choice depends on several factors. Though intricate, the low communica¬ 
tion burden of FMM leads to efficient implementations |19.. Both methods become 
expensive with increased p, the basis order in the present scheme or the multipole 
expansion order for FMM. Yet the highly local work load of the proposed high-order 
screens is more suitable for emerging architectures with accelerators. Unlike FMM, 
the present method is not naturally adaptive to larger regions without singularities — 
e.g., charges. The degree to which FMM takes advantage of this in parallel depends 
on load balancing issues of the system. 

5.3. Other considerations. For dynamic application, the conservation proper¬ 
ties of the overall scheme are important, such as conservation of energy in molecular 
dynamics simulations. Since we are only evaluating potentials in this paper, we do 
not consider momentum or energy conservation in detail. For the formulation as 
presented, the operators we demonstrate do not exactly satisfy the symmetry dis¬ 
cussed by Hockney and Eastwood m , so exact momentum conservation is not antici¬ 
pated. Moreover, as the basis functions are not differentiable at the collocation points, 
straightforward analytical differentiation of the potentials is not always possible. 

The nature of the mesh solve in the presented method lends itself to varied bound¬ 
ary conditions since the fundamental formulation of the algorithm does not change 
when the boundary conditions are changed. As with PME, periodic boundary condi¬ 
tions are the simplest to implement in our method, and require no extra effort beyond 
creating a finite element matrix that honors the periodic structure of the mesh. As 
presented in Section 12.61 Dirichlet and Neumann boundary conditions simply require 
calculations to allow for any short-range effects already present on the surface before 
applying the conditions to the finite element problem. 

The method is also extensible to non-uniform meshes common in finite element 
discretizations without fundamental changes. The main differences for general meshes 
is in the cost of the method. The screens are still built in the same way, that is, they 
still solve (12.101) . However, the discussed simplifications of the screen coefficient calcu- 
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lations depend on a regular, rectangular mesh and are not applicable to an unstruc¬ 
tured mesh with general quadrilateral elements. Thus, the flexibility of a complex 
mesh is balanced with the benefits of localizing the mesh cells. Likewise, the short- 
range potential becomes more difficult to generalize due to the many different shapes 
a screen could take based on the shapes of the elements composing it. Gaining accu¬ 
rate values for the short-range potentials may require quadrature-based evaluations 
for each pair of interacting charges. However, the locality and structured character 
of these operations is expected to coincide with high-throughput accelerators. 

In our demonstration, we have presented one choice for the support of the screens. 
Another possible variation of the method is to limit the screens to have support in 
only the element containing the charge, so that each screen includes only degrees of 
freedom interior to the element and not those on the faces. Using the same multipole 
representation to construct the screen, this choice results in a loss of two powers in the 
short-range decay of the screens — e.g., q = 3 for the screen yields a R~ 3 far-held decay 
instead of R ~ 5 , while of course still requiring p = 5 in the mesh solve (and all the cost 
incurred by this order of p ). However, the more compact screen provides an asymptotic 
decay rate starting at R ~ 1 instead of R ss 3 (see Figure ITSl) . which reduces the 
cost through reducing R c for certain target accuracies. The local composition of these 
one-element screens also facilitates the move to unstructured meshes, helping alleviate 
some of the additional cost in the screen-related calculations. 

In constructing our screens, we have chosen to maximize the far-held decay rate. 
Some simulation goals may be better served by other choices — e.g., by a weighted 
objective function. In such cases, a least-squares optimization might provide screens 
with advantageous properties to meet overall simulation objectives. We have also 
restricted our discussion to purely polynomial basis functions. Given the regularity of 
the underlying Green’s function, basis enrichments with specially designed functions 
chosen to increase the short-range decay likely enhance the overall performance of the 
method, though this also disrupts the exactness of the mesh solve. 
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